This file will pick up on from the Kenya round 1 file and finish the analysis. The cleaning of the data up through the round 1 data (baseline and round 1) happens in the kenya_round_1 folder. Going forward, the cleaning of new data will happen in the shs_cleaning.R file, which I’ll source here, and then the analysis will happen here.
Analyze the soil health study data. This is the intention to treat (ITT) model - we are counting 1AF plots as treatment even if a farmer did not plant 1AF inputs on the plot.
# load the latest kenya data from the cleaning file >> or maybe do the combining there.
#source("shs_cleaning.R")
keDat <- readRDS("ke_cleaned_combined_fieldDat.rds")
rwDat <- readRDS("rw_cleaned_combined_fieldDat.rds")
It’s hard to know where to draw the line between cleaning and analysis. I’m going to keep this code here even though it’s technically additional data cleaning
Check that we’re actually dealing with clients for which we have multiple records
table(table(rwDat$sample_id)==5)
##
## FALSE TRUE
## 157 2328
So we have some clients for which we don’t have 3 records, one for each survey. Find out what the deal is.
table(table(rwDat$sample_id))
##
## 1 2 3 4 5
## 1 6 110 40 2328
Hm. So the numbers less than 5 could be that we’re just not finding people every year. But the 1s and 2s are particularly strange. Let’s check those out. Here’s a helpful link
singleSurvey <- names(which(table(rwDat$sample_id) == 1))
rwDat %>%
filter(sample_id == singleSurvey)
Looks like this person was only surveyed in the first season. Okay. Check out the other seasons
twoSurveys <- names(which(table(rwDat$sample_id) == 2))
rwDat %>%
filter(sample_id %in% twoSurveys)
Okay, these are from the recent round but they don’t have any previous surveys which is odd. We shouldn’t be adding new farmers into the survey. These can be dropped for simplicity. I can follow up with Eric and Cyprien to find out the issue but I’m just going to drop them for now.
rwDat <- rwDat %>%
filter(!sample_id %in% twoSurveys)
fourSurveys <- names(which(table(rwDat$sample_id)==4))
rwDat %>%
filter(sample_id %in% fourSurveys)
Okay. There are enough surveys in the 3 and 4 category that they’re at least plausible from a data quality and survey strategy perspective. I’ll leave them as they are.
Check that there isn’t something systematic in these surveys. If so, describe what it is.
Create a record of how many farmers are joining and leaving Tubura between the baseline through the current survey round. These numbers only reflect the changes from the last round.
This also should account for clients that we miss in each season so we can say what the attrition has been season to season. Also, why don’t we have 16A results for Rwanda. Find out what’s happening there.
clientCount <- function(dat){
# this assumes season, d_client, and sample_id remain the names of
# the key variables. They should for the data to match.
# this funciton tablulates the number of clients in each season
iniTab <- dat %>%
filter(!is.na(d_client)) %>%
group_by(season, d_client) %>%
tally() %>%
gather(key, value, -c(season, d_client)) %>%
spread(d_client, value) %>%
rename(control = `0`,
client = `1`) %>%
mutate(control = as.numeric(control),
client = as.numeric(client)) %>%
rowwise() %>%
mutate(total = sum(control, client)) %>%
ungroup()
percentDiff <- iniTab %>%
mutate(pct_change = (((total - lead(total))/total) * 100))
return(percentDiff)
}
rw.attrition.tab <- clientCount(rwDat)
ke.attrition.tab <- clientCount(keDat)
attritionRate <- function(org, new){
return((org - new)/org * 100)
}
attritionRate(2028, 1935) # kenya
## [1] 4.585799
attritionRate(2439, 2409) # rwanda
## [1] 1.230012
attritionRate(4467, 4344) # total
## [1] 2.753526
ggplot(rwDat, aes(x = season, y = ph, group = d_client)) + geom_boxplot()
ggplot(keDat, aes(x = can.acre)) +
geom_histogram() +
scale_x_continuous(limits = c(0,200)) +
labs(title = "Set limits to 0 and 100 for CAN")
ggplot(keDat, aes(x = dap.acre)) +
geom_histogram() +
scale_x_continuous(limits = c(0,200)) +
labs(title = "Set limits to 0 and 100 for DAP")
#export data for Diego >> this is the easiest way to have the location and soil information together. The soil data is also available through the warehouse but it isn't easily connected to the survey information.
rwDat %>%
select(sample_id, season, district, cell_field, village, ph, calcium, magnesium, organic.carbon, total.nitrogen) %>%
write_csv(., "shs_analysis/output/rwanda_soil_values_travertine_impact.csv")
Check values for upcoming final round of Kenya SHS data collection
This is ideally a grouped boxplot. Fix this when I have internet. This would show the values by season and client type.
See sketch of SHS report. Remember that sameStatus are the farmers that kept their status between baseline and endline. The two models of interest are:
Helpful link for executing code in parallel
Work before 3/18/19
Before running the full models, I’m going to take a closer look at the nitrogen data to make sure we can explain the negative effects we’re seeing in Kenya in the models below. There isn’t a clear agronomic explanation for the negative effect and as a consequence we’re put in a bind in our reporting to explain either the outcome or the modeling.
Let’s start with looking at the relationship between nitrogen and other features to make sure we’re seeing what expect. Here are the notes from Step:
Click here for compost / acre vs. nitrogen rate effects. The short is that there isn’t a clear trend.
Nitrogen application by yield output. The yield values are not making sense between survey rounds. I’ll come back to this.
Look at rotations relative to N rate by 1AF status. It does appear that plots that non-1AF plots receive more legume intercrops. This makes sense given that the 1AF core package emphasizes maize monocrops.
Would love to hear your thoughts - entirely possible I’m missing something here. The most critical question here is whether (i) we’re not applying enough C relative to N, (ii) we’re just not applying enough N. I suspect maybe the answer is both, but I’m wary of assuming it’s the latter and making things worse if the key driver is unbalanced C:N ratios.
3/18/19 forward
Step Aston of ART wants follow up on the possibility that 1AF is driving down N rates through repeated 1AF culti ation. This document lists the key questions we want to answer to understand what might be happening in the data. Those questions are (see document for more specifics):
First, let’s check some summary tables of the change in N between 2015 and 2017 between treatment and control farmers.
keDat %>%
group_by(sample_id,season) %>%
mutate(count = n()) %>%
arrange(season) %>%
mutate(d_client = last(d_client)) %>%
ungroup() %>%
filter(count < 2) %>%
dplyr::select(sample_id, d_client, season, total.nitrogen) %>%
spread(key = season, value = total.nitrogen) %>%
mutate(total.n.difference = `2017` - `2015`) %>%
filter(!is.na(total.n.difference)) %>%
filter(!is.na(`2016`)) %>%
group_by(d_client) %>%
summarize(
"Average N Difference" = mean(total.n.difference),
"Median N Difference" = median(total.n.difference)
) %>%
kable(.,align=rep('c', 5), digits=3, caption="N Difference (2017N - 2015N) Between 2017 Clients and Non-Clients")
| d_client | Average N Difference | Median N Difference |
|---|---|---|
| 0 | -0.010 | -0.008 |
| 1 | -0.011 | -0.010 |
Overall, it appears that both control and treatment clients on average experied a decrease in N over time. Farmers enrolled in 1AF in 2017 had a slightly steeper increase in N than control farmers.
However, this includes farmers who potentially swtiched their status to 1AF/non-1AF.
Let’s examine only farmers who maintained their control/treatment status across the 3 seasons:
keDat %>%
group_by(sample_id,season) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count < 2) %>%
group_by(sample_id,d_client) %>%
mutate(sample_count = n()) %>%
ungroup() %>%
filter(sample_count == 3) %>%
dplyr::select(sample_id, d_client, season, total.nitrogen) %>%
spread(key = season, value = total.nitrogen) %>%
mutate(total.n.difference = `2017` - `2015`) %>%
filter(!is.na(total.n.difference)) %>%
group_by(d_client) %>%
summarize(
"Average N Difference" = mean(total.n.difference),
"Median N Difference" = median(total.n.difference)
) %>%
kable(.,align=rep('c', 5), digits=3, caption="N Difference (2017N - 2015N) Between 2017 Clients and Non-Clients")
| d_client | Average N Difference | Median N Difference |
|---|---|---|
| 0 | -0.010 | -0.008 |
| 1 | -0.011 | -0.010 |
There does not seem to be much of a difference between clients who maintained and clients who changed their N status.
Let’s also group this by the relative field fertility data we collected in 2017.
keDat %>%
group_by(sample_id,season) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count < 2) %>%
group_by(sample_id,d_client) %>%
mutate(sample_count = n()) %>%
arrange(season) %>%
mutate(fertile.comparison = last(fertile.comparison)) %>%
ungroup() %>%
filter(sample_count == 3) %>%
dplyr::select(sample_id, d_client, fertile.comparison,
season, total.nitrogen) %>%
spread(key = season, value = total.nitrogen) %>%
mutate(total.n.difference = `2017` - `2015`) %>%
filter(!is.na(total.n.difference)) %>%
group_by(d_client, fertile.comparison) %>%
summarize(
"Average N Difference" = mean(total.n.difference),
) %>%
spread(key = fertile.comparison, value = `Average N Difference`) %>%
kable(.,align=rep('c', 5), digits=3, caption="N Difference (2017N - 2015N) Between 2017 Clients and Non-Clients by Field Fertility")
| d_client | average | lessfertile | morefertile |
|---|---|---|---|
| 0 | -0.010 | -0.010 | -0.012 |
| 1 | -0.011 | -0.011 | -0.011 |
It appears that OAF clients experienced slightly more N loss in their average/less fertile fields, and slightly less in their more fertile fields. Let’s also look at the count of these values:
keDat %>%
group_by(sample_id,season) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count < 2) %>%
group_by(sample_id,d_client) %>%
mutate(sample_count = n()) %>%
arrange(season) %>%
mutate(fertile.comparison = last(fertile.comparison)) %>%
ungroup() %>%
filter(sample_count == 3) %>%
dplyr::select(sample_id, d_client, fertile.comparison,
season, total.nitrogen) %>%
spread(key = season, value = total.nitrogen) %>%
mutate(total.n.difference = `2017` - `2015`) %>%
filter(!is.na(total.n.difference)) %>%
group_by(d_client, fertile.comparison) %>%
summarize(
Total = n()
) %>%
spread(key = fertile.comparison, value = Total) %>%
kable(.,align=rep('c', 5), digits=3, caption="Total 2017 Clients and Non-Clients by Field Fertility")
| d_client | average | lessfertile | morefertile |
|---|---|---|---|
| 0 | 552 | 149 | 79 |
| 1 | 430 | 48 | 128 |
One Acre Fund clientys mostly planted in fertile/average fields, compared to control clients.
This suggests that field fertility might be an important control variable.
The big takeaway for me from this plot is that there are regular spikes at certain values. That seems odd for a system that should be pretty continuous. Let’s see if that plays out evenly across seasons.
keDat %>%
ggplot(., aes(x = total.nitrogen)) +
geom_histogram(bins = 100) +
geom_density()
keDat %>%
ggplot(., aes(x = total.nitrogen)) +
geom_histogram() +
geom_vline(xintercept = mean(keDat$total.nitrogen, na.rm = T)) +
facet_grid(~ season)
# ggplot() +
# geom_histogram(data = filter(keDat, keDat$season == 2015), aes(x = total.nitrogen, fill = 'red'), alpha = 0.5) +
# geom_histogram(data = filter(keDat, keDat$season == 2016), aes(x = total.nitrogen, fill = 'blue'), alpha = 0.5) +
# geom_histogram(data = filter(keDat, keDat$season == 2017), aes(x = total.nitrogen, fill = 'green'), alpha = 0.5) +
# theme(legend.position = 'none')
This is what we’re observing. There’s a clear downward trend across seasons with perhaps a slight shift downward for
keDat %>%
#select(season, total.nitrogen, d_client) %>%
ggplot(., aes(x = season, y = total.nitrogen, fill = as.factor(d_client))) +
geom_boxplot() +
theme(legend.position = 'bottom') +
labs(title = "Total N by season and client status", subtitle ="Strong seasonal trend with perhaps a slight client trend downward",
x = "Season", y = "Total N", fill = "Client Status")
keDat %>%
filter(!is.na(d_client)) %>%
group_by(season, d_client) %>%
summarize(mean = round(mean(total.nitrogen, na.rm = T),4)) %>%
kable(caption = "Downward average N by season") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| season | d_client | mean |
|---|---|---|
| 2015 | 0 | 0.1577 |
| 2015 | 1 | 0.1579 |
| 2016 | 0 | 0.1499 |
| 2016 | 1 | 0.1484 |
| 2017 | 0 | 0.1478 |
| 2017 | 1 | 0.1466 |
1AF clients are slightly below comparison farmers in terms of total nitrogren but the difference is ever so slight. Perhaps in the context of soil chemistry these are big differencs though!
This code also addresses identified erroneous variables.
keDat <- keDat %>%
mutate(organic.carbon = ifelse(organic.carbon == 0 & total.nitrogen > 0, NA, organic.carbon))
keDat %>%
ggplot(., aes(x = organic.carbon, y = total.nitrogen, color = as.factor(d_client))) +
geom_point(alpha = .3) + geom_smooth(method = "lm") +
labs(title = "N vs. C - what we'd expect", x = "Total Carbon", y = "Total Nitrogen",
subtitle = "And no clear client, non-client trend", color = "1AF client")
The lines are practically overlapping for the N/C plots between clients / non-clients.
keDat %>%
filter(!is.na(d_client)) %>%
ggplot(., aes(x = organic.carbon, y = total.nitrogen, color = as.factor(d_client))) +
geom_point(alpha = .3) + geom_smooth(method = "lm") +
labs(title = "N vs. C by season - again what we'd expect", x = "Organic C", y = "Total N", color = "1AF client",
subtitle = "and no clear client, non-client trend") +
facet_grid(~ season, scales = "fixed")
This is the same table as above but shows carbon level by season and client status. We don’t see the same gap widening between client and non-client plots that we see with nitrogen.
keDat %>%
filter(!is.na(d_client)) %>%
group_by(season, d_client) %>%
summarize(mean = round(mean(organic.carbon, na.rm = T),4)) %>%
kable(caption = "Carbon levels trend together by client status over seasons") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| season | d_client | mean |
|---|---|---|
| 2015 | 0 | 2.1606 |
| 2015 | 1 | 2.1700 |
| 2016 | 0 | 1.8588 |
| 2016 | 1 | 1.8506 |
| 2017 | 0 | 1.8625 |
| 2017 | 1 | 1.8632 |
Interestingly, like nitrogen, the largest decrease in carbon took place between 2015 and 2016.
if you applied compost since there are so many 0s and removing super large values
There’s no clear visual trend between compost application and nitrogen rates.
keDat %>%
filter(compost.acre > 0 & compost.acre < 1000) %>%
ggplot(., aes(x = compost.acre, y = total.nitrogen,
color = as.factor(d_client))) +
geom_jitter(alpha = .2) + geom_smooth(method = "lm") +
facet_grid(~ season, scales = "fixed") +
labs(title = "N vs. compost",
subtitle = "Relationship less clear",
x = "Compost kg / acre", y = "")
Trend here seems similar for both clients/non-clients.
# dap 18% n, can 26%
keDat %>%
mutate(can_n = can.acre * 0.26,
dap_n = dap.acre * 0.18) %>%
rowwise() %>%
mutate(total_n_applied = sum(can_n, dap_n, na.rm = T)) %>%
filter(total_n_applied > 0) %>%
ggplot(., aes(x = total_n_applied, y = total.nitrogen,
color = as.factor(d_client))) +
geom_point(alpha = .2) + geom_smooth(method = "lm") +
facet_wrap(~ season, scales ="fixed") +
labs(title = "Applied N versus TSN",
subtitle = "No strong relationship",
x = "Total Applied N (kgs)",
y = "Total Soil Nitrogen",
color = "Client")
Interpretation: There doesn’t seem to be a strong relationship between total N applied and total nitrogen in the soil. It’s not clear to me what we should have expected from this relationship but in general I don’t see anything odd or a clear difference between the 1AF plots and the treatment plots.
There are also some strange outliers in the 2015 data.
unique_seasons <- unique(keDat$season)
nitrogen_v_compost <- lapply(unique_seasons, function(year){
return(keDat %>%
filter(season == year) %>%
mutate(can_n = can.acre * 0.26,
dap_n = dap.acre * 0.18) %>%
rowwise() %>%
mutate(total_n_applied = sum(can_n, dap_n, na.rm = T)) %>%
filter(total_n_applied > 0)) %>%
filter(compost.acre < 2000 & compost.acre > 0)
})
lapply(nitrogen_v_compost, function(x){
ggplot(x, aes(x = total_n_applied, y = compost.acre, color = as.factor(d_client))) +
geom_point() +
#facet_wrap(~ d_client, scales = 'free') +
geom_smooth(method = 'lm') +
labs(title = paste("Total N applied / acre vs. compost / acre in",
unique(x$season)),
subtitle = "Compost / acre less than 2000 kgs",
x = "Total kg N / acre",
y = "Total kg compost / acre",
color = "Clinet") +
xlim(0, 150) +
ylim(0,1000)
})
## [[1]]
##
## [[2]]
##
## [[3]]
Interpretation I don’t see a strong difference across seasons between total N applied and total compost applied. The trend lines seem generally the same or at least visually not so strong to suggest that 1AF plots and control plots are having different experiences. The 2017 plot indicates a slightly flatter relationship between applied N and applied C on 1AF plots but this is not super dramatic.
applied_c_vs_soil_n <- lapply(unique_seasons, function(year){
return(keDat %>%
filter(season == year) %>%
filter(compost.acre < 2000 & compost.acre > 0))
})
lapply(applied_c_vs_soil_n, function(x){
ggplot(x, aes(x = compost.acre, y = total.nitrogen,
color = as.factor(d_client))) +
geom_point() +
xlim(0,2000) +
ylim(.08,.26) +
geom_smooth(method = 'lm') +
labs(title = paste("Applied C vs. soil N in", unique(x$season)),
subtitle = "No clear difference",
x = "Total kg compost / acre",
y = "Total soil N (%))",
color = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
Interpretation As with the applied N vs. applied C, I don’t see a clear difference in outcome between 1AF and control plots when we look at applied C and soil N.
Next, I will look at the ratio of applied C/N by season to confirm that 1AF farmers are not applying disproportionate N.
n_c_ratio_v_soil <- lapply(unique_seasons, function(year) {
return(keDat %>%
filter(season == year) %>%
mutate(can_n = can.acre * 0.26,
dap_n = dap.acre * 0.18) %>%
rowwise() %>%
mutate(total_n_applied = sum(can_n, dap_n, na.rm = T)) %>%
filter(total_n_applied > 0) %>%
filter(compost.acre < 2000 & compost.acre > 0) %>%
mutate(n_c_ratio = total_n_applied/compost.acre) %>%
filter(n_c_ratio < 2)
)
})
lapply(n_c_ratio_v_soil, function(x){
ggplot(x, aes(x = n_c_ratio, y = total.nitrogen,
color = as.factor(d_client))) +
geom_point() +
xlim(0,2) +
ylim(.08, .27)+
geom_smooth(method = 'lm') +
labs(title = paste("Applied N:C to TSN in", unique(x$season)),
subtitle = "N:C Ratio < 2",
x = "Applied C / Applied N ",
y = "Total soil N (%))",
color = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
Interpretation: There do not seem to be clear differences here either - the confidence intervals for the lines of best fit clearly overlap across all seasons.
Let’s also do a quick check to make sure that the noise in applied compost are not driving these lack of differences.
applied_c_dummy <- lapply(unique_seasons, function(year) {
return(keDat %>%
filter(season == year) %>%
filter(compost.acre < 2000) %>%
mutate(applied.compost = ifelse(compost.acre > 0, "Yes", "No")))
})
lapply(applied_c_dummy, function(x){
ggplot(x, aes(x = applied.compost, y = total.nitrogen,
fill = as.factor(d_client))) +
geom_boxplot() +
coord_cartesian(ylim = c(0.05, .3)) +
labs(title = paste("Applied Compost in ", unique(x$season)),
subtitle = "Compost / acre less than 2000 kgs",
x = "Applied Compost",
y = "Total soil N (%))",
fill = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
Interpretation: It looks like for some reason 1AF farmers applying compost are getting lower N relative to control farmers – especially in 2017.
applied_c_q_dummy <- lapply(unique_seasons, function(year) {
return(keDat %>%
filter(season == year))
})
lapply(applied_c_q_dummy, function(x){
ggplot(x, aes(x = compost.quality, y = total.nitrogen,
fill = as.factor(d_client))) +
geom_boxplot() +
coord_cartesian(ylim = c(0.05, .3)) +
labs(title = paste("Applied Compost Quality in ", unique(x$season)),
x = "Applied Compost",
y = "Total soil N (%))",
fill = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
Interpretation: It looks like for some reason 1AF farmers applying compost are getting lower N relative to control farmers – especially in 2017.
fert_application_rates <- lapply(unique_seasons, function(year) {
return(keDat %>%
filter(season == year) %>%
mutate_at(vars(can.main, dap.main, npk.main, urea.main),
list(~ifelse(is.na(.), 0, .))) %>%
mutate(CAN = ifelse(can.main > 0, 1, 0)) %>%
mutate(DAP = ifelse(dap.main > 0, 1, 0)) %>%
mutate(NPK = ifelse(npk.main > 0, 1, 0)) %>%
mutate(Urea = ifelse(urea.main > 0 , 1, 0)) %>%
group_by(d_client, season) %>%
summarize(
`% Applying CAN` = sum(CAN) / n() * 100,
`% Applying DAP` = sum(DAP) / n() * 100,
`% Applying NPK` = sum(NPK) / n() * 100,
`% Applying Urea` = sum(Urea) / n() * 100
) %>%
ungroup() %>%
filter(!is.na(d_client)) %>%
gather(-d_client, -season, key = "Fertilizer", value = "% Applied"))
})
lapply(fert_application_rates, function(x){
ggplot(x, aes(x = Fertilizer, y = `% Applied`,
fill = as.factor(d_client))) +
geom_col(position = "dodge") +
labs(title = paste("% Applied Fertilizer in ",
unique(x$season)),
x = "Fertilizer Type",
y = "% Applied",
fill = "Client") +
ylim(0,100)
})
## [[1]]
##
## [[2]]
##
## [[3]]
One Acre Fund farmers are applying much more fertilizer compared to control farmers. In 2016/17, almost 100% of 1AF farmers apply basal fertilizer, and just over 75% apply top dress, compared to only 75% and 50% of control farmers respectively.
Not likely to be interpretable until yield data is cleaned up.
p_v_yields <- lapply(unique_seasons, function(year) {
return(keDat %>%
filter(season == year) %>%
filter(yield.t.ha < 1000) %>%
mutate(npk_n = npk.acre * .05,
urea_n = urea.acre * 0.44,
dap_n = dap.acre * 0.46) %>%
rowwise() %>%
mutate(total_p_applied = sum(npk_n, dap_n, urea_n, na.rm = T)) %>%
filter(total_p_applied > 0))
})
lapply(p_v_yields, function(x){
ggplot(x, aes(x = total_p_applied, y = yield.t.ha,
color = as.factor(d_client))) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = paste("Applied P versus Yield in", unique(x$season)),
subtitle = "Yield < 1000 Kg/Ha",
x = "Applied Ph",
y = "Yield",
color = "Client") +
ylim(0,500) +
xlim(0,100)
})
## [[1]]
##
## [[2]]
##
## [[3]]
Interpretation: What the heck is happening in 2016?? But in 2015 and 2017 there is a steeper downward trend for clients than non-clients.
intercrop_check <- keDat %>%
mutate(legume_intercrop = ifelse(intercrop %in% c("beans", "groundnuts", "soya"), 1, 0)) %>%
filter(!is.na(sample_id))
intercrop_check %>%
group_by(sample_id, d_client) %>%
summarize(legume_count = sum(legume_intercrop, na.rm = T),
total.nitrogen.average = mean(total.nitrogen, na.rm = T)) %>%
filter(legume_count <= 3) %>%
ggplot(., aes(x = as.factor(legume_count), y = total.nitrogen.average, fill = as.factor(d_client))) +
geom_boxplot() +
labs(title = "Legume intercropping (beans, groundnuts, soya) and nitrogen",
subtitle = "Control plots have slightly higher N across years of legume intercrops",
x = "Number of seasons of legume intercrops",
y = "Total soil N (%)", fill = "1AF status") +
theme_oaf()
The above plot shows N rates by the number of legume intercrops. It’s not a perfect view of what happens following a season in which a farmer rotates to beans but does show the cumulative benefit of additional seasons of a legume intercrop on the plot. Do 1AF farmers have fewer intercrops? This is hard to precisely pin down since the plots can shift from 1AF cultivation to not-1AF cultivation season to season. Therefore I’ll look at the average years of 1AF cultivation vs. legume intercrops
intercrop_check %>%
group_by(sample_id) %>%
summarize(years_oaf = sum(d_client, na.rm = T),
number_intercrops = sum(legume_intercrop, na.rm = T)) %>%
group_by(years_oaf, number_intercrops) %>%
tally() %>%
filter(years_oaf <= 3) %>%
filter(number_intercrops < 4) %>%
ggplot(., aes(x = as.factor(years_oaf), y = n, fill = as.factor(number_intercrops))) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = "Legume intercrops by 1AF tenure", x = "How many years a plot as been a 1AF plot", y = "Count of plots",
subtitle = "If no years of 1AF cultivation (control), many more years of intercropping\nThe 3 years category has fewer years of intercropping") +
theme(legend.position = 'bottom') +
theme_oaf()
We’re looking for an indication that 1AF plots have fewer legume intercrops and it does appear that plots that have not been cultivated with 1AF have more years of legume intercropping (the blue bar on the left) that for plots that are 1AF plots for more seasons. This relationship is not necessarily statistically different as we also see years of numerous legume intercrops for 1AF plots as well. The main distinguishing feature however is the big bar indicating that plots that have never been 1AF plots have had 3 legume intercrops more than 1AF plots.
In any one season, what is the likelihood that a 1AF plot is intercropped?
What percent of 1AF plots are intercropped in a given season?
intercrop_check %>%
group_by(season, d_client) %>%
summarize(intercropped = mean(legume_intercrop, na.rm = T)) %>%
spread(d_client, intercropped) %>%
rename(control = `0`,
oneacre = `1`) %>%
kable(.) %>%
kable_styling()
| season | control | oneacre |
|---|---|---|
| 2015 | 0.6588004 | 0.5173096 |
| 2016 | 0.6066863 | 0.5627198 |
| 2017 | 0.6078431 | 0.5637427 |
library(broom)
legume_model_data <- intercrop_check %>%
group_by(sample_id) %>%
summarize(years_oaf = sum(d_client, na.rm = T),
number_intercrops = sum(legume_intercrop, na.rm = T),
average_soil_nitrogen = mean(total.nitrogen, na.rm = T))
# plot these data to understand relationship.
legume_model_data %>%
filter(number_intercrops < 4) %>%
ggplot(., aes(x = as.factor(years_oaf), y = average_soil_nitrogen, fill = as.factor(number_intercrops))) +
geom_boxplot() +
theme_oaf() +
scale_y_continuous(breaks = seq(0.0, 2.5, 0.01)) +
labs(title = "Soil N by 1AF seasons and legume cultivations",
subtitle = "Fill is the number of times plot had a legume intercrop",
x = "Number of seasons plot was a 1AF plot",
y = "Average soil N",
fill = "Number of times plot had legume intercrop")
cat("R squard for simple model:", summary(lm(average_soil_nitrogen ~ as.factor(number_intercrops), data = legume_model_data))$r.squared)
## R squard for simple model: 0.0056116
cat("R squard for model including years_oaf:", summary(lm(average_soil_nitrogen ~ as.factor(number_intercrops) + years_oaf, data = legume_model_data))$r.squared)
## R squard for model including years_oaf: 0.005687464
The r2 of models looking at the simple relationship of years of legume cultivation and soil nitrogen is very low. That doesn’t seem to be capturing the movement of soil nitrogen.
intercrop_check %>%
filter(legume_intercrop == 1) %>%
group_by(intercrop, d_client) %>%
tally() %>%
ggplot(., aes(x = intercrop, y = n, fill = as.factor(d_client))) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = "Legume intercrop type by 1AF status",
subtitle = "It's mostly beans, folks and control plots are intercropped a lot more regularly",
x = "Intercrop type",
y = "Number of plots",
fill = "1AF status") +
theme_oaf()
Interpretation: About 400 more control plots were planted with a bean intercrop over the course of the study. However, we don’t see additional seasons of legume intercrop explaining soil N levels in the data.
Does increased intercropping explain the difference between 1AF and control plots? Do the other pieces of evidence line up to support this?
intercrop_check %>%
filter(legume_intercrop == 1) %>%
group_by(intercrop, d_client) %>%
tally() %>%
mutate(d_client = ifelse(d_client == 0, "control", "one-acre")) %>%
spread(d_client, n) %>%
kable(., caption = "Total plots by intercrop type") %>%
kable_styling()
| intercrop | control | one-acre |
|---|---|---|
| beans | 1835 | 1452 |
| groundnuts | 57 | 27 |
| soya | 15 | 6 |
Control plots were intercropped with beans about ~400 times more than 1AF plots which is not nothing.
intercrop_check %>%
filter(legume_intercrop == 1) %>%
group_by(intercrop, d_client) %>%
tally() %>%
ungroup() %>%
mutate(percent = paste(round(n / sum(n), 2) * 100, "%")) %>%
select(-n) %>%
mutate(d_client = ifelse(d_client == 0, "control", "one-acre")) %>%
spread(d_client, percent) %>%
kable(., caption = "Percent of plots by intercrop") %>%
kable_styling()
| intercrop | control | one-acre |
|---|---|---|
| beans | 54 % | 43 % |
| groundnuts | 2 % | 1 % |
| soya | 0 % | 0 % |
Let’s also check the monocrop data - first for frequency, second between legume cultivation and nitrogen.
keDat %>%
group_by(main.crop, d_client) %>%
tally() %>%
ggplot(., aes(x = main.crop, y = n, fill = as.factor(d_client))) +
geom_col(stat = 'identity', position = 'dodge') +
theme(axis.text.x = element_text(angle = 90)) +
labs(
title = "Main Crop Planted",
subtitle = "Almost no other crops planted but maize",
fill = "Client"
)
Basically, everyone was planting maize. Let’s filter out maize.
keDat %>%
filter(main.crop != "maize") %>%
group_by(main.crop, d_client) %>%
tally() %>%
ggplot(., aes(x = main.crop, y = n, fill = as.factor(d_client))) +
geom_col(stat = 'identity', position = 'dodge') +
theme(axis.text.x = element_text(angle = 90)) +
labs(
title = "Main Crop Planted (other than maize)",
subtitle = "Fallowing second most popular",
fill = "Client"
)
Interpretation: The farmers above represent just under 5% of the total sample. Control farmers do not seem disproportionately represented here, so it is unlikely this is driving the change.
keDat %>%
filter(!is.na(sample_id)) %>%
dplyr::select(d_client, sample_id, main.crop, season) %>%
mutate(main.crop = ifelse(!main.crop %in%
c("maize", "beans", "groundnuts", "fallow"),
"other",
ifelse(main.crop %in% c("beans", "groundnuts"),
"legume", main.crop))) %>%
spread(season, main.crop) %>%
mutate("15 - 16 Rotated?" = ifelse(`2015` != `2016`,
"Rotated", "Did Not Rotate")) %>%
mutate("16 - 17 Roated?" = ifelse(`2016` != `2017`,
"Rotated", "Did Not Rotate")) %>%
filter(!is.na(`15 - 16 Rotated?`)) %>%
group_by(`15 - 16 Rotated?`, d_client) %>%
tally() %>%
ggplot(., aes(x = `15 - 16 Rotated?`, y = n, fill = as.factor(d_client))) +
geom_col(position = 'dodge') +
labs(title = "2015 - 2016 Crop Rotation by 1AF Status",
subtitle =
"Control and 1AF farmers rotated plotsi n about the same frequency",
fill = "Client")
keDat %>%
filter(!is.na(sample_id)) %>%
dplyr::select(d_client, sample_id, main.crop, season) %>%
mutate(main.crop = ifelse(!main.crop %in%
c("maize", "beans", "groundnuts", "fallow"),
"other",
ifelse(main.crop %in% c("beans", "groundnuts"),
"legume", main.crop))) %>%
spread(season, main.crop) %>%
mutate("15 - 16 Rotated?" = ifelse(`2015` != `2016`,
"Rotated", "Did Not Rotate")) %>%
mutate("16 - 17 Rotated?" = ifelse(`2016` != `2017`,
"Rotated", "Did Not Rotate")) %>%
filter(!is.na(`16 - 17 Rotated?`)) %>%
group_by(`16 - 17 Rotated?`, d_client) %>%
tally() %>%
ggplot(., aes(x = `16 - 17 Rotated?`, y = n, fill = as.factor(d_client))) +
geom_col(position = 'dodge') +
labs(title = "2016 - 2017 Crop Rotation by 1AF Status",
subtitle =
"Control and 1AF farmers rotated plotsi n about the same frequency",
fill = "Client")
Interpretation: Not a lot of crop rotation happening.
Clean up region variable
keDat$region <- ifelse(grepl("west", tolower(keDat$region)), "Western", "Nyanza")
I want to create a metric of how different treatment values are from control values by district. I’m going to subtract the control average at the district level from each treatment value for each season. I’ll then plot them individually so it’s easier to see.
control_nitrogen_level <- function(year){
return(keDat %>%
filter(!is.na(district)) %>%
filter(season == year) %>%
group_by(district) %>%
mutate(control_nitrogen_average = mean(total.nitrogen[d_client == 0], na.rm = T)) %>%
ungroup() %>%
mutate(difference_nitrogen = total.nitrogen - control_nitrogen_average) %>%
filter(d_client == 1))
}
nitrogen_plots <- lapply(c(2015, 2016, 2017), function(year){
control_nitrogen_level(year)
})
lapply(nitrogen_plots, function(x){
ggplot(x, aes(x = district, y = difference_nitrogen, fill = region)) +
geom_boxplot() +
labs(title = paste("Difference in nitrogen ", unique(x$season)),
subtitle = "I don't see a clear spatial trend across locations or time") +
theme_oaf() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 45,
hjust = 1))
})
## [[1]]
##
## [[2]]
##
## [[3]]
Quickly, let’s output graphs for each district:
unique_districts <- unique(keDat$district)[!unique(keDat$district) %in% NA]
lapply(unique_districts, function(dist){
keDat %>%
filter(!is.na(district)) %>%
filter(district == dist) %>%
ggplot(., aes(x = season, y = total.nitrogen, fill = as.factor(d_client))) +
geom_boxplot() +
coord_cartesian(ylim = c(0.05, .3)) +
labs(title = paste("Seasonal trend for", dist),
x = "Season",
y = "Total soil N (%)") +
theme_oaf()
})
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
When we look at this by district over time and 1AF plot status we definitely see a downward trend in some districts. We also see that in some cases, like Butere, the 1AF plots started well below controls. In ohters, like Suneka, we see the 1AF drop dramatically which is pretty difficult to explain. A couple other observations:
Farmers could be shifting plots with less N into 1AF cultivation and moving other plots out of 1AF rotation. Let’s check out how these trends look if we use the baseline assignment as our indicator for 1AF / control status.
When we look at the data based on the 2015 baseline treatment status, the difference between 1AF and control plots isn’t as extreme. The analytical model we’re using below should be taking into acocunt the effect of clients potentially switching their plot back and forth.
keDat %>%
ggplot(., aes(x = region, y = total.nitrogen)) +
geom_boxplot() +
facet_grid(~ season, scales = "fixed")
keDat %>%
filter(!is.na(district)) %>%
ggplot(., aes(x = district, y = total.nitrogen)) +
geom_boxplot() +
facet_grid(season ~ ., scales = "fixed") +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1))
This section will plot the changes in N on 1AF/control farmer plots spatially to determine if there are spatial trends to the decrease. Above we used boxplots for district; however, district lines are somewhat arbitrary so may obscure cross-district trends.
The trick here is to use what nitrogen to plot compared to control? If we are not aggregating this by district and looking at it spatially, then it doesn’t make sense to use the control average either.
First I will plot 2017 data Nitrogen levels.
# commcare GPS split
keDat <- cbind(keDat, commcare_gps_split(keDat$gps))
keDat17 <- keDat %>% filter(season == "2017")
# creating color palettes
totalNPals <- colorNumeric(palette = "Blues",
domain = unique(keDat17$total.nitrogen))
# getting the Kenya map
mapN17 <- leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addControl('All N Values - T/C 2017', position = "bottomleft") %>%
setView(lat=keDat17$lat[1], lng=keDat17$lon[1],zoom=7) %>%
addScaleBar() %>%
addCircles(data = keDat17, lng = keDat17$lon, lat = keDat17$lat,
radius = 1000, stroke = 0,
color = ~totalNPals(keDat17$total.nitrogen), fillOpacity = .8)%>%
addLegend(pal = totalNPals, values = keDat17$total.nitrogen,
title = "Total N")
mapN17
Now, let’s calculate the difference for all control plots within 5km. This is still somewhat arbitrary, but should give us more precision than using the district.
# lookup table [here](https://gis.stackexchange.com/questions/213971/units-for-radius-in-nn2-in-rann-package)
# 2 degress ~ 1.11 km, so 20 X 1.11 ~ 20kms
keDat17 <- keDat %>% filter(season == "2017")
# testing another way
keDat17mC <- keDat17 %>%
filter(d_client == 0) %>%
dplyr::select(lon, lat, total.nitrogen) %>%
filter(!is.na(lon) & !is.na(lat))
crdref <- CRS('+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0')
keDat17mC <-SpatialPointsDataFrame(coords = keDat17mC[,c("lon", "lat")],
data = keDat17mC, proj4string = crdref)
keDat17mT <- keDat17 %>%
filter(d_client == 1) %>%
dplyr::select(lon, lat, total.nitrogen) %>%
filter(!is.na(lon) & !is.na(lat))
keDat17mT <-SpatialPointsDataFrame(coords = keDat17mT[,c("lon", "lat")],
data = keDat17mT, proj4string = crdref)
res <- pointDistance(keDat17mT, keDat17mC, lonlat = TRUE,
allpairs = TRUE)
# turning into Km from m
res <- res / 1000
# calculating minimum average for all values
keDat17mC@data$CountingBool <- 1
keDat17mT@data$total.nitrogenNeightbor <- apply(res, 1, function(x)
mean(keDat17mC@data$total.nitrogen[x < 5], na.rm = TRUE)
)
keDat17mT@data$totalNeighbors5km <- apply(res, 1, function(x)
sum(keDat17mC@data$CountingBool[x < 5], na.rm = TRUE)
)
keDat17mT@data$totalNeighbors10km <- apply(res, 1, function(x)
sum(keDat17mC@data$CountingBool[x < 10], na.rm = TRUE)
)
keDat17mT@data$totalNeighbors20km <- apply(res, 1, function(x)
sum(keDat17mC@data$CountingBool[x < 20], na.rm = TRUE)
)
keDat17mT@data$n.difference = keDat17mT@data$total.nitrogen -
keDat17mT@data$total.nitrogenNeightbor
# creating color palettes
totalNDifference <- colorNumeric(palette = "Spectral",
domain = unique(keDat17mT$n.difference))
keDat17mTF10 <- keDat17mT[keDat17mT@data$totalNeighbors5km > 5,]
# getting the Kenya map
mapNDif2 <- leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addControl('T/C Plot N Difference', position = "bottomleft") %>%
setView(lat=keDat17mTF10$lat[1], lng=keDat17mTF10$lon[1],zoom=7) %>%
addScaleBar() %>%
addCircles(data = keDat17mTF10@data, lng = keDat17mTF10$lon,
lat = keDat17mTF10$lat,
radius = 2000, stroke = 0,
color = ~totalNDifference(keDat17mTF10$n.difference),
fillOpacity = .8)%>%
addLegend(pal = totalNDifference, values = keDat17mTF10$n.difference,
title = "T - C N")
mapNDif2
#saveWidget(mapNDif2, file="5kmRadius2017ControlTreatmentNDifference.html")
Interpretation: It looks like the overall difference between neighbors is not as substantial as the overall results show.
We can also check if this pattern is statistically significant by computing the spatial autocorrelation. I will weight values according values within 2kms 5kms, and 10kms of eachother. These will be used for the weights in spatial correlation using Moran’s I.
#http://www.bias-project.org.uk/ASDARcourse/unit6_slides.pdf
keDat17mTNoNas = subset(keDat17mT, !is.na(n.difference))
res2 <- pointDistance(keDat17mTNoNas, lonlat = TRUE,
allpairs = TRUE)
res2 <- res2 / 1000
res2_2 <- (res2 > 0 & res2 <= 2)
res2_5 <- (res2 > 0 & res2 <= 5)
res2_10 <- (res2 > 0 & res2 <= 10)
res2_2 <- as.matrix(Matrix::forceSymmetric(res2_2,uplo="L"))
res2_5 <- as.matrix(Matrix::forceSymmetric(res2_5,uplo="L"))
res2_10 <- as.matrix(Matrix::forceSymmetric(res2_10,uplo="L"))
distlw2 <- mat2listw(res2_2, style = "B")
distlw5 <- mat2listw(res2_5, style="B")
distlw10 <- mat2listw(res2_10, style="B")
moran.test(keDat17mTNoNas$n.difference, distlw2, zero.policy = TRUE)
##
## Moran I test under randomisation
##
## data: keDat17mTNoNas$n.difference
## weights: distlw2 n reduced by no-neighbour observations
##
##
## Moran I statistic standard deviate = 3.4057, p-value = 0.0003299
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.089334107 -0.001422475 0.000710131
moran.test(keDat17mTNoNas$n.difference, distlw5, zero.policy = TRUE)
##
## Moran I test under randomisation
##
## data: keDat17mTNoNas$n.difference
## weights: distlw5 n reduced by no-neighbour observations
##
##
## Moran I statistic standard deviate = 0.4399, p-value = 0.33
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.0048380016 -0.0011834320 0.0001873654
moran.test(keDat17mTNoNas$n.difference, distlw10, zero.policy = TRUE)
##
## Moran I test under randomisation
##
## data: keDat17mTNoNas$n.difference
## weights: distlw10 n reduced by no-neighbour observations
##
##
## Moran I statistic standard deviate = 1.241, p-value = 0.1073
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 8.530650e-03 -1.173709e-03 6.115318e-05
The first test results shows the autocorrelation betwen N decrease relative to control plots in 1AF plots within 2km of each, the second 5km, and the third 10km. The null hypothesis is that there is spatial randomness.
At 2kms and below, there is evidence for some spatial autocorrelation between differences in N. However, about 18% of our data had no neighbors within 2kms and another 20% only had one neighbor. There is not evidence that this extends to larger geographic areas, like AEZs.
moran.plot(keDat17mTNoNas$n.difference,
distlw2, pch=19)
The plot above shows the line of best fit for the Moran autocorrelation at 2km and difference in N. In the upper-right hand quadrant, are values that have a positive correlation (neighbors and point are above average); in the lower-left quadrant are values with negative correlation. The other two quadrants are for values not correlated with neighbor values.
Let’s take a look at an example with no correlation.
moran.plot(keDat17mTNoNas$n.difference,
distlw10, pch=19)
The above plot shows the results of the 10km regression - the line of best fit is practically at 0.
We can also take a look at the local Moran values to try to see which points (in other words, which areas) are significantly correlated.
LM <- localmoran(keDat17mTNoNas$n.difference, distlw2,
zero.policy=TRUE, na.action=na.fail, p.adjust.method="none")
keDat17mTNoNas$LocalM_PValue <- LM[, 5]
# creating color palettes
pValPal <- colorNumeric(palette = "Spectral",
domain = unique(keDat17mTNoNas$LocalM_PValue))
# getting the Kenya map
mapLocalMPValues <- leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addControl('P-Value for Spatial Correlation', position = "bottomleft") %>%
setView(lat=keDat17mTNoNas$lat[1], lng=keDat17mTNoNas$lon[1],zoom=7) %>%
addScaleBar() %>%
addCircles(data = keDat17mTNoNas@data, lng = keDat17mTNoNas$lon,
lat = keDat17mTNoNas$lat,
radius = 2000, stroke = 0,
color = ~pValPal(keDat17mTNoNas$LocalM_PValue),
fillOpacity = .8)%>%
addLegend(pal = pValPal, values = keDat17mTNoNas$LocalM_PValue,
title = "P Values")
mapLocalMPValues
There does not seem to be as much correlation in Nyanza.
Now, let’s take a look at the same map, but trying to subset for field fertility.
We only have enough sample to look at “average fertility” plots, so these are the only plots I will consider.
keDat17mC_FC_Avg <- keDat17 %>%
filter(d_client == 0) %>%
filter(fertile.comparison == "average") %>%
dplyr::select(lon, lat, total.nitrogen) %>%
filter(!is.na(lon) & !is.na(lat)) %>%
SpatialPointsDataFrame(coords = .[,c("lon", "lat")],
data = ., proj4string = crdref)
keDat17mT_FC_Avg <- keDat17 %>%
filter(d_client == 1) %>%
filter(fertile.comparison == "average") %>%
dplyr::select(lon, lat, total.nitrogen) %>%
filter(!is.na(lon) & !is.na(lat)) %>%
SpatialPointsDataFrame(coords = .[,c("lon", "lat")],
data = ., proj4string = crdref)
## Split datasets by field fertility
res_Avg <- (pointDistance(keDat17mT_FC_Avg, keDat17mC_FC_Avg, lonlat = TRUE,
allpairs = TRUE)) / 1000
### Averaging Total Nitrogen/Getting Neighbor Counts
keDat17mC_FC_Avg@data$CountingBool <- 1
keDat17mT_FC_Avg@data$total.nitrogenNeightbor <- apply(res_Avg, 1, function(x)
mean(keDat17mC_FC_Avg@data$total.nitrogen[x < 5], na.rm = TRUE)
)
keDat17mT_FC_Avg@data$totalNeighbors5km <- apply(res_Avg, 1, function(x)
sum(keDat17mC_FC_Avg@data$CountingBool[x < 5], na.rm = TRUE)
)
Difference Between 5+ Average Plots within 5km Radius
keDat17mT_FC_Avg <- keDat17mT_FC_Avg[keDat17mT_FC_Avg@data$totalNeighbors5km > 5,]
keDat17mT_FC_Avg@data$n.difference = keDat17mT_FC_Avg@data$total.nitrogen -
keDat17mT_FC_Avg@data$total.nitrogenNeightbor
# getting the Kenya map
mapNDif_AvgPlots <- leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addControl('T/C Plot N Difference - Average Plots', position = "bottomleft") %>%
setView(lat=keDat17mT_FC_Avg$lat[1], lng=keDat17mT_FC_Avg$lon[1],zoom=7) %>%
addScaleBar() %>%
addCircles(data = keDat17mT_FC_Avg@data, lng = keDat17mT_FC_Avg$lon,
lat = keDat17mT_FC_Avg$lat,
radius = 2000, stroke = 0,
color = ~totalNDifference(keDat17mT_FC_Avg$n.difference),
fillOpacity = .8)%>%
addLegend(pal = totalNDifference, values = keDat17mT_FC_Avg$n.difference,
title = "T - C N")
mapNDif_AvgPlots
keDatBE <- keDat %>%
filter(d_client == 1) %>%
group_by(oafid,season) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count < 2) %>%
dplyr::select(oafid, season, total.nitrogen) %>%
spread(key = season, value = total.nitrogen) %>%
mutate(total.n.difference = `2017` - `2015`) %>%
filter(!is.na(total.n.difference)) %>%
filter(!is.na(`2016`))
keDatcoords <- keDat %>%
filter(d_client == 1) %>%
group_by(oafid,season) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count < 2) %>%
group_by(oafid) %>%
summarize(
lon = first(lon),
lat = first(lat)
)
keDatBE <- merge(keDatBE, keDatcoords, by = "oafid")
# getting the Kenya map
mapNDif3 <- leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addControl("Baseline Plot N Difference", position = "bottomleft") %>%
setView(lat=keDatBE$lat[1], lng=keDatBE$lon[1],zoom=7) %>%
addScaleBar() %>%
addCircles(data = keDatBE, lng = keDatBE$lon,
lat = keDatBE$lat,
radius = 2000, stroke = 0,
color = ~totalNDifference(keDatBE$total.n.difference),
fillOpacity = .8)%>%
addLegend(pal = totalNDifference, values = keDatBE$total.n.difference,
title = "15 - 17 N")
mapNDif3
#saveWidget(mapNDif3, file="Baseline2017NDifference.html")
These spatial correlations might be driven partially by soil texture. Let’s check TSN by soil type, and then see if there are any spatial patterns with soil texture.
soilTypevTSN <- lapply(unique_seasons, function(year){
return(
keDat %>%
group_by(sample_id) %>%
mutate(soil.Texture15 = first(soil.texture[season == "2015"])) %>%
ungroup() %>%
filter(season == year)
)
})
lapply(soilTypevTSN, function(x){
ggplot(x, aes(x = soil.Texture15, y = total.nitrogen,
fill = as.factor(d_client))) +
geom_boxplot() +
coord_cartesian(ylim = c(0.05, .3)) +
geom_smooth(method = 'lm') +
labs(title = paste("Soil Texture versus TSN in", unique(x$season)),
subtitle = "",
x = "Soil Texture",
y = "Soil N",
fill = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
There are not huge differences here. In 2016/2017, sandy 1AF plots fared worse than sandy control plots.
It doesn’t look like we collected soil texture in 2017?? I am using 2015 data instead, which only used 3 types (clay / loam / sandy). The 2017 data had about 20 types, but with those numbers we have small sample sizes.
unique(keDat$soil.texture)
## [1] "loam" "sandy" "clay" "siltloam"
## [5] "clayloam" "siltyclayloam" "sandyclayloam" "siltyclay"
## [9] "sandyclay" NA "loamysand" "sandyloam"
## [13] "siltclay" "siltyloam" "sand" "loamy"
## [17] "sitlyclay" "siltclayloam" "slitloam" "sitlyclayloam"
## [21] "loamysandy"
Assess whether ratio of N application rates vs N uptake rates is different between treatment and control. We’d have to make some assumptions about
Quick histogram of farmer estimated yields.
ggplot(keDat, aes(x = yield.t.ha)) +
geom_histogram(binwidth = 0.25) +
scale_x_continuous(limit = c(0, 20)) +
labs(title = "Histogram of farmer est. yields (less than 20 t/ha)",
subtitle = "Esitmates are very noisy. 75th percentile is 15 t/ha")
#kable(quantile(keDat$yield.t.ha, na.rm = T), caption = "Most observations are less than 15 t/ha which is still really high")
These all seem very low?
ggplot(filter(keDat, yield < 2000),
aes(x = yield)) + facet_grid(~season, scales = "free") +
geom_density()
GGplot naturally selects the best scales. As we can see, the yield was between 0 - 100 (units not reported) in 2015 and 2016 - although 2016 is skewed way towards the lower side. 2016 seems bounded by 100.
It’s super clear that these are not are not the same units. Matt divided these by 100 to get tons per hecatre. I need to follow-up with M&E for the original 2015/2016/2017 datasets, because they aren’t in the folders I have.
applied_n_vs_yield <- lapply(unique_seasons, function(year){
return(
keDat %>%
filter(season == year) %>%
mutate(can_n = can.acre * 0.26,
dap_n = dap.acre * 0.18) %>%
rowwise() %>%
mutate(total_n_applied = sum(can_n, dap_n, na.rm = T)) %>%
mutate(total_n_applied_ha = total_n_applied * 2.47) %>%
filter(total_n_applied > 0) %>%
filter(yield.t.ha < 10)) %>%
as.data.frame()
})
lapply(applied_n_vs_yield, function(x){
ggplot(x, aes(x = total_n_applied_ha, y = yield.t.ha, color = as.factor(d_client))) +
geom_point() +
facet_wrap(~ season, scales = 'fixed') +
geom_smooth(method = 'lm') +
labs(title = paste("Applied N vs. farmer estimated yield", unique(x$season)),
subtitle = "Limiting farmer est. yield to < 10 t/ha - Trend lines are messy but suggest no difference",
x = "Total kg N / ha ",
y = "Yield t/ha",
fill = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
n_yields_ratio_v_soil <- lapply(unique_seasons, function(year) {
return(keDat %>%
filter(season == year) %>%
mutate(can_n = can.acre * 0.26,
dap_n = dap.acre * 0.18) %>%
rowwise() %>%
mutate(total_n_applied = sum(can_n, dap_n, na.rm = T)) %>%
filter(total_n_applied > 0) %>%
mutate(n_yield_ratio = yield.t.ha / total_n_applied)) %>%
filter(n_yield_ratio < 500)
})
lapply(n_yields_ratio_v_soil, function(x) {
ggplot(x, aes(x = n_yield_ratio, y = total.nitrogen,
color = as.factor(d_client))) +
geom_point() +
ylim(0.05, .3) +
xlim(0, 500) +
geom_smooth(method = 'lm') +
labs(title = paste("Applied N > 0, Yield T/Ha:N < 500", unique(x$season)),
#subtitle = "Compost / acre less than 2000 kgs & > 0",
x = "Yield / Applied N",
y = "Total soil N (%))",
color = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
Above plots show more outliers. Below I will only look at data where this relationship is less than 25, since most of our data falls in that range.
n_yields_ratio_v_soil2 <- lapply(unique_seasons, function(year) {
return(keDat %>%
filter(season == year) %>%
mutate(can_n = can.acre * 0.26,
dap_n = dap.acre * 0.18) %>%
rowwise() %>%
mutate(total_n_applied = sum(can_n, dap_n, na.rm = T)) %>%
filter(total_n_applied > 0) %>%
mutate(n_yield_ratio = yield.t.ha / total_n_applied)) %>%
filter(n_yield_ratio < 25)
})
lapply(n_yields_ratio_v_soil2, function(x){
ggplot(x, aes(x = n_yield_ratio, y = total.nitrogen,
color = as.factor(d_client))) +
geom_point() +
ylim(0.05, .3) +
xlim(0, 25) +
geom_smooth(method = 'lm') +
labs(title = paste("Applied N > 0, Yield:N < 25 in", unique(x$season)),
#subtitle = "Compost / acre less than 2000 kgs & > 0",
x = "Yield / Applied N",
y = "Total soil N (%))",
color = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
Interpretation: What the heck is happening in 2016?? But in 2015 and 2017 there is a steeper downward trend for clients than non-clients.
N applied as manure instead of plant material. Let’s look at compost by type by client status by season
keDat <- keDat %>%
mutate(compost_category = ifelse(grepl("pig|goat|cow|chicken", compost.type), "animal", "plant"),
compost.acre = ifelse(is.na(compost.acre), 0, compost.acre))
keDat %>%
group_by(compost_category) %>%
filter(compost.acre == 0) %>%
tally()
keDat %>%
filter(compost.acre < 2000) %>%
ggplot(., aes(x = as.factor(compost_category), y = compost.acre, fill = as.factor(d_client))) +
geom_boxplot() +
facet_wrap(~ season) +
labs(title = "Compost application by type including those that don't apply",
subtitle = "1AF plots are getting more animal manure when compost is applied",
x = "Compost type - animal or plant (plant residue / kitchen)",
y = "Total kg compost / acre",
fill = "Client")
Interpretation If anything this suggests that 1AF plots are getting more N rich compost. Unless I’m misunderstanding the consequence here, this should be contributing to N rates in the soil, not diminishing soil N.
This section will explore if any differences in N can be attributed to when the fields were converted from bush. This question was included in the 2019 survey.
Unfortunately, that means that we are missing some of this data for some farmers due to survey attrition.
fieldConversionDat <- read_csv("keFieldConversionDat.csv") %>%
dplyr::select(FieldConversion, sample_id)
keDat <- merge(keDat, fieldConversionDat, by = "sample_id",
all.x = TRUE)
fieldConversionvYields <- lapply(unique_seasons, function(year){
return(
keDat %>%
filter(season == year) %>%
mutate(FieldConversion = ifelse(
FieldConversion == "0_5_years",
"1. 0 - 5", FieldConversion
)) %>%
mutate(FieldConversion = ifelse(
FieldConversion == "11-15_years",
"3. 11 - 15", FieldConversion
)) %>%
mutate(FieldConversion = ifelse(
FieldConversion == "16_plus_years",
"4. 16 + ", FieldConversion
)) %>%
mutate(FieldConversion = ifelse(
FieldConversion == "6-10_years",
"2. 6 - 10", FieldConversion
)) %>%
as.data.frame()
)
})
lapply(fieldConversionvYields, function(x){
ggplot(x, aes(x = FieldConversion, y = total.nitrogen,
fill = as.factor(d_client))) +
geom_boxplot() +
facet_wrap(~ season, scales = 'fixed') +
geom_smooth(method = 'lm') +
labs(title = paste("N versus Field Conversion", unique(x$season)),
subtitle = "Grouped this way 1AF plots start out higher in 2015 and are lower in 2017",
x = "Field Conversion (as of 2019)",
y = "Soil N",
fill = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
In 2015, across all field conversion categories 1AF plots seemed to be doing as well or slightly better than their control counterparts; this trend is reversed by 2017 - espeically among plots converted in the last 6 - 10 years (as of 2019).
Lastly, this section will explore the relationship between hybrid seed use and n leaching. 1AF farmers are more likely than their control counterparts to use hybrid seed. In theory, we should have been able to see some of this effect while looking at yields, but our yield data is likely a bit noisy.
hybridSeedUsevN <- lapply(unique_seasons, function(year){
return(
keDat %>%
filter(season == year)
)
})
lapply(hybridSeedUsevN, function(x){
ggplot(x, aes(x = seed.type, y = total.nitrogen,
fill = as.factor(d_client))) +
geom_boxplot() +
facet_wrap(~ season, scales = 'fixed') +
geom_smooth(method = 'lm') +
labs(title = paste("N versus Field Conversion", unique(x$season)),
subtitle = "",
x = "Seed Type",
y = "Soil N",
fill = "Client")
})
## [[1]]
##
## [[2]]
##
## [[3]]
Interestingly, missing data on seed use and using local seeds had lower N than fields using hybrid seed.
Potentially 1AF farmers using local seeds are selecting their lowest fertility fields, or overcompensating with fertilizer?
keySoilVars <- c("ph", "calcium", "total.nitrogen",
"organic.carbon", "magnesium")
keDat <- keDat %>%
# cleaning age variable
mutate(age = ifelse(age > 90, 90, age)) %>%
mutate(age = ifelse(age < 18, 18, age)) %>%
mutate(age2 = age^2) %>%
group_by(sample_id) %>%
# TODO: remove this once I get the original 15/16 data
mutate(fertile.comparison.imputed =
first(fertile.comparison[season == "2017"])) %>%
ungroup()
indFeList <- list("as.factor(d_client)",
c("as.factor(d_client)", "as.factor(sample_id)"),
c("as.factor(d_client)", "as.factor(sample_id)",
"as.factor(season)"),
c("as.factor(d_client)",
#"as.factor(sample_id)",
"as.factor(season)",
"as.factor(fertile.comparison.imputed)"))
#c("as.factor(d_client)", "as.factor(sample_id)",
# "as.factor(season)", "age", "age2"))
# run this in parallel to speed up the process
# load the data and variables and packages into the cluster
regFileKe <- "regFile_through2017_updated.RData"
forceUpdate <- FALSE
if(!file.exists(regFileKe) || forceUpdate) {
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores, type="FORK")
clusterEvalQ(cl, "plm")
clusterExport(cl, "keDat")
clusterExport(cl, "keySoilVars")
clusterExport(cl, "indFeList")
indFeLoopKe <- parLapply(cl, indFeList, function(mod){
lapply(keySoilVars, function(outcome){
form = lm(reformulate(termlabels = mod, response = outcome),
data=keDat)
pdf(file=paste("updated_regs/",
paste0(outcome,
paste(mod, collapse = "")),
".pdf", sep = ""))
print(plot(form))
dev.off()
form = plm(form, c("sample_id"))
rownames(form) = paste(rownames(form), outcome, sep = " ")
return(form)
})
})
stopCluster(cl)
save(indFeLoopKe, file=regFileKe)
} else {
load(regFileKe)
}
And combine model outputs into tables for each model.
modExportKe <- lapply(indFeLoopKe, function(models){
do.call(rbind, models)
})
for(i in 1:length(modExportKe)){
write.csv(modExportKe[i],
file=paste0("updated_regs/","regOutput_", i, ".csv"), row.names = T)
}
In the individual fixed effect model above, the naive model would only include a client indicator and individual fixed effects. If we add season, we lose significance on almost everything. I’d guess that as we add more likely controls we additionally lose significance. I’ve included age and age squared along the lines of Hicks et.al.
finalModelKe <- modExportKe[4]
kable(finalModelKe, format="markdown")
| Coefficient | Std.Error | T | P | |
|---|---|---|---|---|
| (Intercept) ph | 5.8915326 | 0.0144865 | 406.6905831 | 0.0000000 |
| as.factor(d_client)1 ph | -0.0086401 | 0.0161710 | -0.5342945 | 0.5931734 |
| as.factor(season)2016 ph | -0.3898327 | 0.0158459 | -24.6014547 | 0.0000000 |
| as.factor(fertile.comparison.imputed)lessfertile ph | -0.0451456 | 0.0237381 | -1.9018189 | 0.0572809 |
| as.factor(fertile.comparison.imputed)morefertile ph | -0.0036578 | 0.0225004 | -0.1625637 | 0.8708718 |
| (Intercept) calcium | 1103.9541285 | 18.4120703 | 59.9581747 | 0.0000000 |
| as.factor(d_client)1 calcium | -26.4036361 | 17.8318839 | -1.4806981 | 0.1387480 |
| as.factor(season)2016 calcium | -79.3296154 | 21.8140116 | -3.6366358 | 0.0002789 |
| as.factor(season)2017 calcium | -37.2267199 | 21.2074379 | -1.7553615 | 0.0792567 |
| as.factor(fertile.comparison.imputed)lessfertile calcium | -81.1561909 | 26.2448833 | -3.0922672 | 0.0019969 |
| as.factor(fertile.comparison.imputed)morefertile calcium | -4.7545541 | 24.6867623 | -0.1925953 | 0.8472835 |
| (Intercept) total.nitrogen | 0.1594114 | 0.0010400 | 153.2768041 | 0.0000000 |
| as.factor(d_client)1 total.nitrogen | -0.0001576 | 0.0010053 | -0.1567223 | 0.8754698 |
| as.factor(season)2016 total.nitrogen | -0.0085993 | 0.0012279 | -7.0031548 | 0.0000000 |
| as.factor(season)2017 total.nitrogen | -0.0109652 | 0.0011988 | -9.1466583 | 0.0000000 |
| as.factor(fertile.comparison.imputed)lessfertile total.nitrogen | -0.0011206 | 0.0014806 | -0.7568210 | 0.4491914 |
| as.factor(fertile.comparison.imputed)morefertile total.nitrogen | -0.0055478 | 0.0013871 | -3.9995374 | 0.0000643 |
| (Intercept) organic.carbon | 2.1730153 | 0.0130681 | 166.2835424 | 0.0000000 |
| as.factor(d_client)1 organic.carbon | -0.0000425 | 0.0126390 | -0.0033602 | 0.9973190 |
| as.factor(season)2016 organic.carbon | -0.3191894 | 0.0154391 | -20.6741404 | 0.0000000 |
| as.factor(season)2017 organic.carbon | -0.3066667 | 0.0150638 | -20.3577957 | 0.0000000 |
| as.factor(fertile.comparison.imputed)lessfertile organic.carbon | -0.0134068 | 0.0186041 | -0.7206383 | 0.4711646 |
| as.factor(fertile.comparison.imputed)morefertile organic.carbon | -0.0083893 | 0.0174306 | -0.4812970 | 0.6303257 |
| (Intercept) magnesium | 187.6312690 | 2.8667754 | 65.4502866 | 0.0000000 |
| as.factor(d_client)1 magnesium | 1.6397472 | 3.2016911 | 0.5121504 | 0.6085798 |
| as.factor(season)2016 magnesium | -13.9049027 | 3.1370467 | -4.4324819 | 0.0000096 |
| as.factor(fertile.comparison.imputed)lessfertile magnesium | -11.7993745 | 4.6917246 | -2.5149333 | 0.0119523 |
| as.factor(fertile.comparison.imputed)morefertile magnesium | -9.1469363 | 4.4691113 | -2.0467014 | 0.0407658 |
write.csv(finalModelKe, file="updated_regs/indFe_ke2017.csv")
One enumerator in 2015 and 2016 was taking random samples but not actually visiting farms. This affected ~30 samples. There were 20 other farmers who were not able to be traced in the 2017 study, although no systematic fraud was confirmed. I will also remove these farmers.
ghostFarmers <- read_csv("GhostFarmerwSamples.csv")
keDatNoGhostFarmers <- keDat %>%
# cleaning age variable
filter(!(sample_id %in% ghostFarmers$soil_sample_id))
regFileKeNoGhostFarmers <- "regFile_through2017_updated_noGhostFarmers.RData"
if(!file.exists(regFileKeNoGhostFarmers) || forceUpdate) {
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores, type="FORK")
clusterEvalQ(cl, "plm")
clusterExport(cl, "keDat")
clusterExport(cl, "keySoilVars")
clusterExport(cl, "indFeList")
indFeLoopKeNoGhost <- parLapply(cl, indFeList, function(mod){
lapply(keySoilVars, function(outcome){
form = lm(reformulate(termlabels = mod, response = outcome),
data=keDatNoGhostFarmers)
pdf(file=paste("updated_regs/",
paste0(outcome, "_NoGhostFarmers_",
paste(mod, collapse = "")),
".pdf", sep = ""))
print(plot(form))
dev.off()
form = plm(form, c("sample_id"))
rownames(form) = paste(rownames(form), outcome, sep = " ")
return(form)
})
})
stopCluster(cl)
save(indFeLoopKeNoGhost, file=regFileKeNoGhostFarmers)
} else {
load(regFileKeNoGhostFarmers)
}
And combine model outputs into tables for each model.
modExportKeNoGhostFarmers <- lapply(indFeLoopKeNoGhost,
function(models){
do.call(rbind, models)
})
for(i in 1:length(modExportKeNoGhostFarmers)) {
write.csv(modExportKeNoGhostFarmers[i],
file=paste0("updated_regs/","regOutput_NoGhostFarmers", i, ".csv"), row.names = T)
}
In the individual fixed effect model above, the naive model would only include a client indicator and individual fixed effects. If we add season, we lose significance on almost everything. I’d guess that as we add more likely controls we additionally lose significance. I’ve included age and age squared along the lines of Hicks et.al.
finalModelKeNoGhostFarmers <- modExportKeNoGhostFarmers[4]
kable(finalModelKeNoGhostFarmers, format="markdown")
| Coefficient | Std.Error | T | P | |
|---|---|---|---|---|
| (Intercept) ph | 5.8884825 | 0.0144727 | 406.8686514 | 0.0000000 |
| as.factor(d_client)1 ph | -0.0067414 | 0.0161346 | -0.4178247 | 0.6761022 |
| as.factor(season)2016 ph | -0.3870087 | 0.0158164 | -24.4687888 | 0.0000000 |
| as.factor(fertile.comparison.imputed)lessfertile ph | -0.0483466 | 0.0237094 | -2.0391274 | 0.0415162 |
| as.factor(fertile.comparison.imputed)morefertile ph | -0.0031604 | 0.0224289 | -0.1409078 | 0.8879513 |
| (Intercept) calcium | 1101.2682333 | 18.4164276 | 59.7981465 | 0.0000000 |
| as.factor(d_client)1 calcium | -25.0101508 | 17.8167315 | -1.4037452 | 0.1604550 |
| as.factor(season)2016 calcium | -77.0407810 | 21.8047535 | -3.5332104 | 0.0004141 |
| as.factor(season)2017 calcium | -35.0283826 | 21.1967452 | -1.6525359 | 0.0984862 |
| as.factor(fertile.comparison.imputed)lessfertile calcium | -84.0539136 | 26.2491800 | -3.2021539 | 0.0013723 |
| as.factor(fertile.comparison.imputed)morefertile calcium | -4.4378601 | 24.6435430 | -0.1800821 | 0.8570952 |
| (Intercept) total.nitrogen | 0.1593762 | 0.0010433 | 152.7545958 | 0.0000000 |
| as.factor(d_client)1 total.nitrogen | -0.0001350 | 0.0010074 | -0.1340042 | 0.8934045 |
| as.factor(season)2016 total.nitrogen | -0.0085591 | 0.0012310 | -6.9529183 | 0.0000000 |
| as.factor(season)2017 total.nitrogen | -0.0109248 | 0.0012018 | -9.0906089 | 0.0000000 |
| as.factor(fertile.comparison.imputed)lessfertile total.nitrogen | -0.0012715 | 0.0014852 | -0.8560902 | 0.3919875 |
| as.factor(fertile.comparison.imputed)morefertile total.nitrogen | -0.0055539 | 0.0013888 | -3.9990573 | 0.0000645 |
| (Intercept) organic.carbon | 2.1728415 | 0.0131136 | 165.6931407 | 0.0000000 |
| as.factor(d_client)1 organic.carbon | 0.0000139 | 0.0126692 | 0.0010946 | 0.9991267 |
| as.factor(season)2016 organic.carbon | -0.3188357 | 0.0154825 | -20.5933092 | 0.0000000 |
| as.factor(season)2017 organic.carbon | -0.3064313 | 0.0151052 | -20.2865155 | 0.0000000 |
| as.factor(fertile.comparison.imputed)lessfertile organic.carbon | -0.0141307 | 0.0186674 | -0.7569721 | 0.4491010 |
| as.factor(fertile.comparison.imputed)morefertile organic.carbon | -0.0084489 | 0.0174566 | -0.4839912 | 0.6284126 |
| (Intercept) magnesium | 187.4604162 | 2.8725296 | 65.2596990 | 0.0000000 |
| as.factor(d_client)1 magnesium | 1.7444502 | 3.2039604 | 0.5444668 | 0.5861569 |
| as.factor(season)2016 magnesium | -13.6529246 | 3.1404858 | -4.3473925 | 0.0000142 |
| as.factor(fertile.comparison.imputed)lessfertile magnesium | -12.3761802 | 4.6999202 | -2.6332746 | 0.0084958 |
| as.factor(fertile.comparison.imputed)morefertile magnesium | -9.1639295 | 4.4680760 | -2.0509789 | 0.0403474 |
write.csv(finalModelKeNoGhostFarmers, file="updated_regs/indFe_ke2017_noGhostFarmers.csv")
Random effects for farm, calculate # of times someone has been a client. Will be added in once I get raw 2015/2015 data.
JS: not executing these for now.
The parallel model wasn’t running for some reason so I’m just going to run the one model I really want to look at and save those results.
rwDat <- rwDat %>% mutate(age2 = age^2)
rwDat <- do.call(data.frame,lapply(rwDat, function(x){
replace(x, is.infinite(x),NA)
}))
indFeList <- list("as.factor(d_client)",
c("as.factor(d_client)", "as.factor(sample_id)"),
c("as.factor(d_client)", "as.factor(sample_id)",
"as.factor(season)"),
c("as.factor(d_client)", "as.factor(sample_id)",
"as.factor(season)", "age", "age2"))
# run this in parallel to speed up the process
# load the data and variables and packages into the cluster
regFileRw <- "regFile_through2017b.RData"
#forceUpdate <- forceUpdateAll
if(!file.exists(regFileRw) || forceUpdate) {
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores, type="FORK")
clusterEvalQ(cl, "plm")
clusterExport(cl, "rwDat")
clusterExport(cl, "keySoilVars")
clusterExport(cl, "indFeList")
indFeLoopRw <- parLapply(cl, indFeList, function(mod){
lapply(keySoilVars, function(outcome){
form = lm(reformulate(termlabels = mod, response = outcome), data=rwDat)
pdf(file=paste("output/rw2017b/", paste0(outcome, paste(mod, collapse = "")), ".pdf", sep = ""))
print(plot(form))
dev.off()
form = plm(form, c("sample_id", "age", "age2"))
rownames(form) = paste(rownames(form), outcome, sep = " ")
return(form)
})
})
stopCluster(cl)
save(indFeLoopRw, file=regFileRw)
} else {
load(regFileRw)
}
modExportRw <- lapply(indFeLoopRw, function(models){
do.call(rbind, models)
})
for(i in 1:length(modExportRw)){
write.csv(modExportRw[i], file=paste0("output/rw2017b/","regOutput_", i, ".csv"), row.names = T)
}
finalModelRw <- modExportRw[4]
kable(finalModelRw, format="markdown")
write.csv(finalModelRw, file="output/rw2017b/indFe_rw2017.csv")
Nothing to see here